International Cherry Blossom Prediction Competition

Exploratory Data Analysis 2 and Feature Engineering

Author

Edimer David Jaramillo

Published

February 29, 2024

Document description

  • In this document, exploratory data analysis is oriented to variables that change over time. In the document 05-EDA1.qmd several of the results make use of the response variable bloom_doy summarized with the median, in this document the objective is to explore the temporal variation of the response variable and the association with climatic variables (df_weather) and photoperiod (df_photoperiod).

Libraries and setup

Code
# Libraries
library(tidyverse)
library(arrow)
library(glue)
library(reactable)
library(latex2exp)
library(splines)
library(forecast)
library(plotly)

# Colors
colors_custom <-
  c("#014e25",
    "#800080",
    "#ffa500",
    "#008080",
    "#ff6347",
    "#0000cd")

# Theme ggplot2
theme_set(theme_bw() + theme(legend.position = "top"))

# Functions

## These functions are not required in this document
functions_exclude <-
  c(
    "../source/r/module-EDA",
    "../source/r/extractPhotoperiod.R",
    "../source/r/getWeatherPOWER.R"
  )

dir_functions <-
  fs::dir_ls("../source/r/")

c(dir_functions[!dir_functions %in% functions_exclude],
  fs::dir_ls("../source/r/module-EDA/")) |>
  walk(.f = source)

# Inputs
countries <- c("Japan", "Switzerland", "South Korea", "USA-WDC")

Databases

  • In total there are 490 different coordinates.
  • df_full: complete data with locations (location and country), coordinates and response variable. I select only the variables of interest for this analysis.
    • Total rows: 17966
    • Total columns: 8
  • df_weather: weather data for each coordinate
    • Total rows: 7738651
    • Total columns: 17
    • Date range: 1981-01-01 until 2024-02-25
  • df_photoperiod: photoperiod data for each coordinate.
    • Total rows: 9456891
    • Total columns: 3
    • Date range: 0811-04-02 until 2023-12-30
  • From the information provided by NPN (df_npn) I take the Site_ID == 32789 and the Species_ID == 228 to identify the records of the site of interest in New York.

Data structuring and initial Feature Engineering (FE)

  • In the database data_bloom_year I keep the coordinates of interest and the year of registration.
  • I obtain a table with summary metrics (data_summary_weather) per year (since 1981) for each coordinate and for each of the climate variables, the calculated metrics were the average, median, messtandard deviation, minimum and maximum. With this table it is possible to relate climatic information from the same flowering year or information from delayed years. Finally I create a database (data_weather_yearly) with the response variable bloom_doy, the coordinates, the year and the climate information. This table could also be used for model building.
  • I obtain a table with summary metrics (data_summary_weather_monthly) by year and month (since 1981) for each coordinate and for each of the climate variables, the calculated metrics were the average, median, messtandard deviation, minimum and maximum. With this table it is possible to relate climatic information from the same year and month of flowering or information from delayed years and months. Finally I create a database (data_weather_yearly) with the response variable bloom_doy, the coordinates, the year, month and the climate information. This table could also be used for model building.
  • Having climatic variables summarized on a monthly basis has advantages and disadvantages. Personally I think that the greatest advantage comes from the level of detail in the climate profile of the locations, however, since there are 9 climate variables, three summary metrics (average, median and standard deviation) and 12 months implies that 324 new variables are generated, to these variables are added the 12 that I obtain with the variable FROST_DAYS for a total of 336 possible predictors. As many of these variables are correlated, using multivariate techniques for dimensionality reduction allows improving characterization in a smaller dimensional space.
  • Although I have photoperiod information from the year 81 (one year ago from the first date at each latitude), taking into account that the climate information is from 1981, I also apply this filter to the photoperiod records. I obtain three summary metrics (average, median and standard deviation) for the photoperiod by year (data_photop_coords_yearly) and by year-month (data_photop_coords_monthly), with which 36 new variables are generated.
Code
# ------ DATA ------------------------------

# Data with coordinates and response variable
df_full_complete <-
  read_parquet("../external-data/df_full.parquet") |>
  mutate(
    location = if_else(
      location == "Switzerland/Z\xfcrich-Albisg\xfcetli",
      true = "Switzerland/Zürich-Albisgüetli",
      false = location
    ),
    location = if_else(
      location == "Switzerland/M\xf6hlin",
      true = "Switzerland/Möhlin",
      false = location
    ),
    location = if_else(
      location == "Switzerland/W\xe4denswil",
      true = "Switzerland/Wädenswil",
      false = location
    ),
    location = if_else(
      location == "Switzerland/Z\xfcrich-MeteoSchweiz",
      true = "Switzerland/Zürich-MeteoSchweiz",
      false = location
    ),
    location = if_else(
      location == "Switzerland/Z\xfcrich-Witikon",
      true = "Switzerland/Zürich-Witikon",
      false = location
    ),
    location = if_else(
      location == "Switzerland/Gr\xfcsch",
      true = "Switzerland/Grüsch",
      false = location
    ),
    location = if_else(
      location == "Switzerland/N\xe4fels",
      true = "Switzerland/Näfels",
      false = location
    ),
    location = if_else(
      location == "Switzerland/Sch\xf6nenwerd",
      true = "Switzerland/Schönenwerd",
      false = location
    ),
    location = if_else(
      location == "Switzerland/D\xf6ttingen",
      true = "Switzerland/Döttingen",
      false = location
    ),
    location = if_else(
      location == "Switzerland/H\xf6fen",
      true = "Switzerland/Höfen",
      false = location
    ),
    location = if_else(
      location == "Switzerland/Le S\xe9pey",
      true = "Switzerland/Le Sépey",
      false = location
    ),
    location = if_else(
      location == "Switzerland/La Br\xe9vine",
      true = "Switzerland/La Brévine",
      false = location
    ),
    location = if_else(
      location == "Switzerland/Alchenfl\xfch",
      true = "Switzerland/Alchenflüh",
      false = location
    ),
    location = if_else(
      location == "Switzerland/Neuch\xe2tel",
      true = "Switzerland/Neuchâtel",
      false = location
    )
  )

df_full <-
  df_full_complete |>
  select(location, country, lat, long, alt, year, bloom_date, bloom_doy) |>
  distinct_all()

# Climate data
df_weather <- read_parquet("../external-data/df_weather.parquet")

# Photoperiod data
df_photoperiod <- read_parquet("../external-data/df_photoperiod.parquet")

# Data NPN
df_npn <- read_parquet("../external-data/df_npn_usa.parquet")

# ------ Coordinates to predict ------------------------------

# Coordinates Kyoto
coord_kyoto <-
  df_full |>
  filter(str_detect(location, "kyoto|Kyoto|KYOTO")) |>
  distinct(lat, long) |>
  slice(2) |>
  mutate(location = "Kyoto (Japan)") |>
  relocate(location, everything())

# Coordinates Liestal
coord_liestal <-
  df_full |>
  filter(str_detect(location, "liestal")) |>
  distinct(lat, long) |>
  mutate(location = "Liestal-Weideli (Swtizerland)") |>
  relocate(location, everything())

# Coordinates USA WDC
coord_usa_wdc <-
  df_full |>
  filter(str_detect(country, "USA-WDC")) |>
  distinct(lat, long) |>
  mutate(location = "Washington, D.C. (USA)") |>
  relocate(location, everything())

# Coordinates Vancouver
coord_vancouver <-
  df_weather |>
  filter(lat >= 49.22 & lat <= 49.3) |>
  distinct(lat, long) |>
  mutate(location = "Vancouver, BC (Canada)") |>
  relocate(location, everything())

# Coordinates New York
coords_ny <-
  df_npn |>
  filter(Site_ID == 32789) |>
  filter(Species_ID == 228) |>
  distinct(lat, long)

coord_usa_ny <-
  coords_ny |>
  mutate(location = "New York City, NY (USA)") |>
  relocate(location, everything())

# Dataframe with coordinates to predict
df_coords_predict <-
  bind_rows(coord_kyoto,
            coord_liestal,
            coord_usa_wdc,
            coord_vancouver,
            coord_usa_ny) |>
  rename(location_name = location)

# Data frame with coordinates of interest
df_full_coords_predict <-
  df_full |>
  filter(lat %in% df_coords_predict$lat) |>
  left_join(df_coords_predict, by = c("lat", "long"))

# Data frame with meteorological information of interest for the coordinates
# to be predicted
df_weather_coords_predict <-
  df_weather |>
  filter(lat %in% df_full_coords_predict$lat |
           long %in% df_full_coords_predict$long) |>
  distinct(lat, long) |>
  left_join(df_full_coords_predict |>
              distinct(lat, long, location_name),
            by = c("lat", "long")) |>
  slice(-4) |>
  replace_na(list(location_name = "Liestal-Weideli (Swtizerland)"))

# Guarantee only one registration per year in each coordinate
data_bloom_year <-
  df_full_coords_predict |>
  group_by(lat, long, year) |>
  reframe(bloom_doy = max(bloom_doy, na.rm = TRUE),
          bloom_date_max = max(bloom_date, na.rm = TRUE),
          altitude = unique(alt),
          location_name = unique(location_name))

# ------ YEARLY CLIMATE ------------------------------

# Climate summary by year in coordinates to predict
data_summary_weather1 <-
  df_weather |>
  filter(lat %in% df_weather_coords_predict$lat) |>
  select(-c(MM, DD, DOY, FROST_DAYS)) |>
  pivot_longer(-c(lat, long, YEAR, YYYYMMDD)) |>
  group_by(lat, long, YEAR, name) |>
  reframe(
    MEAN = mean(value, na.rm = TRUE),
    MEDIAN = median(value, na.rm = TRUE),
    STD = sd(value, na.rm = TRUE),
    MIN = min(value, na.rm = TRUE),
    MAX = max(value, na.rm = TRUE),
  ) |>
  pivot_wider(names_from = name,
              values_from = -c(lat, long, YEAR, name)) |> 
  rename(year = YEAR)

data_summary_weather2 <-
  df_weather |>
  filter(lat %in% df_weather_coords_predict$lat) |>
  select(lat, long, YEAR, YYYYMMDD, FROST_DAYS) |>
  pivot_longer(-c(lat, long, YEAR, YYYYMMDD)) |>
  group_by(lat, long, YEAR, name) |>
  reframe(
    SUM = sum(value, na.rm = TRUE)
  ) |>
  pivot_wider(names_from = name,
              values_from = -c(lat, long, YEAR, name)) |> 
  rename(year = YEAR) |> 
  rename(TOTAL_FROST_DAYS = FROST_DAYS)

data_summary_weather <-
  left_join(data_summary_weather1, data_summary_weather2,
            by = c("lat", "long", "year"))

# Annual climate data
# Join by long 
data_weather_yearly <-
  left_join(data_bloom_year |> mutate(lat = as.character(lat),
                                      long = as.character(long)),
            data_summary_weather|> mutate(lat = as.character(lat),
                                      long = as.character(long)),
            by = c("long", "year")) |>
  filter(year > 1980) |> 
  rename(lat = lat.x) |> 
  select(-lat.y) |> 
  relocate(lat, long, everything()) |> 
  mutate(lat = as.numeric(lat),
         long = as.numeric(long))


# ------ MONTHLY CLIMATE ------------------------------

# Climate summary by year and month in coordinates to predict
data_summary_weather_monthly1 <-
  df_weather |>
  filter(lat %in% df_weather_coords_predict$lat) |>
  select(-c(DD, DOY, FROST_DAYS)) |>
  pivot_longer(-c(lat, long, YEAR, MM, YYYYMMDD)) |>
  group_by(lat, long, YEAR, MM, name) |>
  reframe(
    MEAN = mean(value, na.rm = TRUE),
    MEDIAN = median(value, na.rm = TRUE),
    STD = sd(value, na.rm = TRUE)
  ) |>
  pivot_wider(names_from = c(MM, name),
              values_from = -c(lat, long, YEAR, MM, name)) |>
  rename(year = YEAR)

data_summary_weather_monthly2 <-
  df_weather |>
  filter(lat %in% df_weather_coords_predict$lat) |>
  select(lat, long, YEAR, MM, YYYYMMDD, FROST_DAYS) |>
  group_by(lat, long, YEAR, MM) |>
  reframe(TOTAL_FROST_DAYS = sum(FROST_DAYS, na.rm = TRUE))  |>
  pivot_wider(names_from = c(MM),
              values_from = -c(lat, long, YEAR, MM)) |>
  set_names(c("lat", "long", "year", str_c("TOTAL_FROST_DAYS_M", 1:12)))

data_summary_weather_monthly <-
  left_join(data_summary_weather_monthly1, data_summary_weather_monthly2,
            by = c("lat", "long", "year"))

# Monthly climate data
# Join by long
data_weather_monthly <-
  left_join(
    data_bloom_year |> mutate(lat = as.character(lat),
                              long = as.character(long)),
    data_summary_weather_monthly |> mutate(lat = as.character(lat),
                                           long = as.character(long)),
    by = c("long", "year")
  ) |>
  filter(year > 1980) |>
  rename(lat = lat.x) |>
  select(-lat.y) |>
  relocate(lat, long, everything()) |>
  mutate(lat = as.numeric(lat),
         long = as.numeric(long))

# ------ PHOTOPERIOD  ------------------------------
data_photop_coords <-
  df_photoperiod |>
  filter(lat %in% df_coords_predict$lat) |>
  filter(year(date_photoperiod) > 1980) |> 
  mutate(year = year(date_photoperiod))

data_photop_coords_yearly <-
  data_photop_coords |>
  group_by(year, lat) |>
  reframe(
    mean_photop = mean(photoperiod, na.rm = TRUE),
    medianphotop = median(photoperiod, na.rm = TRUE),
    std_photop = sd(photoperiod, na.rm = TRUE)
  )

data_photop_coords_monthly <-
  data_photop_coords |>
  mutate(month_rec = month(date_photoperiod)) |>
  group_by(year, month_rec, lat) |>
  reframe(
    mean_photop = mean(photoperiod, na.rm = TRUE),
    median_photop = median(photoperiod, na.rm = TRUE),
    std_photop = sd(photoperiod, na.rm = TRUE)
  ) |>
  pivot_wider(
    names_from = month_rec,
    values_from = c(mean_photop, median_photop, std_photop)
  )

Generalities

  • I obtain a table that shows the range of dates for which there is a record for each location (coordinate), the number (n) of years for which there is a record of flowering is also added.
  • The record with the minimum flowering date is April 1, 812.
  • Some locations do not have updated information. The coordinate Lat: 46.52733 - Long: 8.932936 has 10 records from 1978 to 1987, being the coordinate with the “most outdated” record.
Code
df_full |> 
  group_by(location, lat, long) |> 
  reframe(min_date_doy = min(bloom_date, na.rm = TRUE),
          max_date_doy = max(bloom_date, na.rm = TRUE),
          n = n()) |> 
  arrange(min_date_doy) |> 
  mutate(across(where(is.numeric), ~ round(.x, digits = 4))) |> 
  customReactTable()

Distribution of days between flowering

  • To calculate the difference (in days) between flowering \(t_i\) with \(t_{i-1}\) I take into account coordinates that meet the following characteristics:
    • Minimum 2 records \((n > 1)\)
    • Difference between years equal to 1, that is, I only keep coordinates whose blooms have been reported with an annual frequency, it does not matter if it was 20 years ago, the important thing is that there is information year after year.
  • If we call \(\gamma\) the bloom date (bloom_date) that has measurements in a year \(t\), with \(t = 1, 2, ..., k\), where \(k\) is the number of records that meet the two restrictions described previously, the distribution shown in the following graphs is the difference of \(\gamma_t\) with \(\gamma_{t-1}\), that is, the lag operator with \(p = 1\) on the difference in days between flowerings. It is important to mention that the user will be able to choose the lag value to generate \((\gamma _{t-p})\), in which case for the first condition \(n\) is expressed as \(n = p + 1\); The second condition remains the same regardless of the value of \(p\).
Code
n_lag_operator <- 1

res_japan <-
  lagBloomDate(
    data = df_full,
    n_lag = n_lag_operator,
    country_sel = countries[1],
    pal_colors = colors_custom
  )

res_kyoto <-
  lagBloomDate(
    data = df_full |> filter(str_detect(location, "kyoto|Kyoto|KYOTO")),
    n_lag = n_lag_operator,
    country_sel = countries[1],
    pal_colors = colors_custom
  )

res_switzerland <-
  lagBloomDate(
    data = df_full,
    n_lag = n_lag_operator,
    country_sel = countries[2],
    pal_colors = colors_custom
  )

res_southk <-
  lagBloomDate(
    data = df_full,
    n_lag = n_lag_operator,
    country_sel = countries[3],
    pal_colors = colors_custom
  )

res_usa_wdc <-
  lagBloomDate(
    data = df_full,
    n_lag = n_lag_operator,
    country_sel = countries[4],
    pal_colors = colors_custom
  )

res_usa_ny <-
  lagBloomDate(
    data = df_full |>
      filter(lat %in% coords_ny$lat & long %in% coords_ny$long),
    n_lag = n_lag_operator,
    country_sel = NULL,
    pal_colors = colors_custom
  )

res_japan$plot_diff
res_kyoto$plot_diff +
  labs(subtitle = glue("Kyoto: 2 coordinates with p = {n_lag_operator}"))
res_switzerland$plot_diff
res_southk$plot_diff
res_usa_wdc$plot_diff
res_usa_ny$plot_diff

Code
n_lag_operator <- 5

res_japan <-
  lagBloomDate(
    data = df_full,
    n_lag = n_lag_operator,
    country_sel = countries[1],
    pal_colors = colors_custom
  )

res_kyoto <-
  lagBloomDate(
    data = df_full |> filter(str_detect(location, "kyoto|Kyoto|KYOTO")),
    n_lag = n_lag_operator,
    country_sel = countries[1],
    pal_colors = colors_custom
  )

res_switzerland <-
  lagBloomDate(
    data = df_full,
    n_lag = n_lag_operator,
    country_sel = countries[2],
    pal_colors = colors_custom
  )

res_southk <-
  lagBloomDate(
    data = df_full,
    n_lag = n_lag_operator,
    country_sel = countries[3],
    pal_colors = colors_custom
  )

res_usa_wdc <-
  lagBloomDate(
    data = df_full,
    n_lag = n_lag_operator,
    country_sel = countries[4],
    pal_colors = colors_custom
  )

res_japan$plot_diff
res_kyoto$plot_diff +
  labs(subtitle = glue("Kyoto: 2 coordinates with p = {n_lag_operator}"))
res_switzerland$plot_diff
res_southk$plot_diff
res_usa_wdc$plot_diff

 

 

DOY lagged

  • In the following graphs the response variable in year \(t\) is related to \(t-i\). Unlike the previous graphs in these I use the variable bloom_doy and evaluate a decade, that is, \(i = 1, 2, ..., 10\).
Code
# Japan
lagsDOY(data = df_full,
        country_sel = countries[1],
        pal_colors = colors_custom)$plot_lags
# Kyoto
lagsDOY(data = df_full |> filter(str_detect(location, "kyoto|Kyoto|KYOTO")),
        country_sel = countries[1],
        pal_colors = colors_custom)$plot_lags +
  labs(subtitle = "Kyoto")
# Switzerland
lagsDOY(data = df_full,
        country_sel = countries[2],
        pal_colors = colors_custom)$plot_lags
# South Korea
lagsDOY(data = df_full,
        country_sel = countries[3],
        pal_colors = colors_custom)$plot_lags
# USA-WDC
lagsDOY(data = df_full,
        country_sel = countries[4],
        pal_colors = colors_custom)$plot_lags

Autocorrelation function (ACF - PACF)

  • For these charts I get the maximum value of bloom_doy per year.
Code
# Japan
plotACFMaxDOY(data = df_full, country_sel = countries[1])$plot1
plotACFMaxDOY(data = df_full, country_sel = countries[1])$plot2
plotACFMaxDOY(data = df_full, country_sel = countries[1])$plot3
# Kyoto
plotACFMaxDOY(data = df_full |> filter(str_detect(location, "kyoto|Kyoto|KYOTO")),
              country_sel = countries[1])$plot1 +
  labs(subtitle = "Kyoto (all data)")
plotACFMaxDOY(data = df_full |> filter(str_detect(location, "kyoto|Kyoto|KYOTO")),
              country_sel = countries[1])$plot2 +
  labs(subtitle = "Kyoto (all data)")
plotACFMaxDOY(data = df_full |> filter(str_detect(location, "kyoto|Kyoto|KYOTO")),
              country_sel = countries[1])$plot3 +
  labs(subtitle = "Kyoto (50 years)")
# Switzerland
plotACFMaxDOY(data = df_full, country_sel = countries[2])$plot1
plotACFMaxDOY(data = df_full, country_sel = countries[2])$plot2
plotACFMaxDOY(data = df_full, country_sel = countries[2])$plot3
# South Korea
plotACFMaxDOY(data = df_full, country_sel = countries[3])$plot1
plotACFMaxDOY(data = df_full, country_sel = countries[3])$plot2
plotACFMaxDOY(data = df_full, country_sel = countries[3])$plot3
# USA-WDC
plotACFMaxDOY(data = df_full, country_sel = countries[4])$plot1
plotACFMaxDOY(data = df_full, country_sel = countries[4])$plot2
plotACFMaxDOY(data = df_full, country_sel = countries[4])$plot3

Doy time series

  • Given that the historical record for the coordinates is different and that very few have old records, I decide to use data from the year 1950.
  • In the graph of Japan there are some coordinates with bloom_doy values less than 60, these 10 places are listed below:
    • Japan/Naze
    • Japan/Naze/Funchatoge
    • Japan/Yonagunijima
    • Japan/Iriomotejima
    • Japan/Ishigakijima
    • Japan/Miyakojima
    • Japan/Kumejima
    • Japan/Naha
    • Japan/Nago
    • Japan/Minamidaitojima
  • The red line of the mita graph is the average trend obtained through a GAM model with five degrees of freedom.
Code
timeSerieDOY(data = df_full,
             country_sel = countries[1],
             pal_colors = colors_custom)$plot1
timeSerieDOY(data = df_full |> filter(bloom_doy > 60),
             country_sel = countries[1],
             pal_colors = colors_custom)$plot2 +
  labs(subtitle = "Japan with DOY > 60")
timeSerieDOY(data = df_full |> filter(bloom_doy > 60),
             country_sel = countries[1],
             pal_colors = colors_custom)$plot3 +
  labs(subtitle = "Japan with DOY > 60")

Code
timeSerieDOY(
  data = df_full |> filter(str_detect(location, "kyoto|Kyoto|KYOTO")),
  country_sel = countries[1],
  pal_colors = colors_custom
)$plot1  +
  labs(subtitle = "Kyoto")
timeSerieDOY(data = df_full |> filter(str_detect(location, "kyoto|Kyoto|KYOTO")),
             country_sel = countries[1],
             pal_colors = colors_custom)$plot2 +
  labs(subtitle = "Kyoto")
timeSerieDOY(data = df_full |> filter(str_detect(location, "kyoto|Kyoto|KYOTO")),
             country_sel = countries[1],
             pal_colors = colors_custom)$plot3 +
  labs(subtitle = "Kyoto")

Code
timeSerieDOY(data = df_full,
             country_sel = countries[2],
             pal_colors = colors_custom)$plot1
timeSerieDOY(data = df_full,
             country_sel = countries[2],
             pal_colors = colors_custom)$plot2
timeSerieDOY(data = df_full,
             country_sel = countries[2],
             pal_colors = colors_custom)$plot3

Code
timeSerieDOY(data = df_full,
             country_sel = countries[3],
             pal_colors = colors_custom)$plot1
timeSerieDOY(data = df_full,
             country_sel = countries[3],
             pal_colors = colors_custom)$plot2
timeSerieDOY(data = df_full,
             country_sel = countries[3],
             pal_colors = colors_custom)$plot3

Code
timeSerieDOY(data = df_full,
             country_sel = countries[4],
             pal_colors = colors_custom)$plot1
timeSerieDOY(data = df_full,
             country_sel = countries[4],
             pal_colors = colors_custom)$plot2
timeSerieDOY(data = df_full,
             country_sel = countries[4],
             pal_colors = colors_custom)$plot3

Code
timeSerieDOY(data = df_full |>
      filter(lat %in% coords_ny$lat & long %in% coords_ny$long),
             country_sel = NULL,
             pal_colors = colors_custom)$plot1
timeSerieDOY(data = df_full |>
      filter(lat %in% coords_ny$lat & long %in% coords_ny$long),
             country_sel = NULL,
             pal_colors = colors_custom)$plot2
timeSerieDOY(data = df_full |>
      filter(lat %in% coords_ny$lat & long %in% coords_ny$long),
             country_sel = NULL,
             pal_colors = colors_custom)$plot3

Time series: coordinates of interest

  • I extract only the coordinates of interest to the competition.

Code
g1 <-
  df_full_coords_predict |>
  ggplot(aes(x = year, y = bloom_doy)) +
  facet_wrap( ~ location_name, scales = "free") +
  geom_line(color = colors_custom[1],
            alpha = 0.65) +
  geom_smooth(
    method = "gam",
    formula = y ~ ns(x, df = 5),
    color = colors_custom[2],
    se = TRUE,
    size = 0.7
  ) +
  labs(
    x = "Year",
    y = "DOY",
    title = "DOY time series",
    subtitle = "Original variable - Historical"
  )

g2 <-
  df_full_coords_predict |>
  filter(year > 1950) |>
  group_by(location_name, lat, long) |>
  mutate(bloom_doy_stand = scale(bloom_doy, scale = TRUE, center = TRUE)) |>
  ungroup() |>
  ggplot(aes(x = year, y = bloom_doy_stand, color = location_name)) +
  geom_point(size = 0.65) +
  geom_line(size = 0.5, alpha = 0.5) +
  geom_hline(
    yintercept = 0,
    lty = 2,
    color = "gray15",
    size = 0.2
  ) +
  geom_smooth(
    method = "gam",
    formula = y ~ ns(x, df = 5),
    color = "gray15",
    se = TRUE,
    size = 0.7
  ) +
  labs(
    x = "Year",
    y = latex2exp::TeX(r'($DOY\ \left(\frac{x_i - \mu}{\sigma}\right)$)'),
    title = "DOY time series",
    subtitle = "Standardized variable - since 1950",
    color = ""
  ) +
  scale_color_manual(values = colors_custom) +
  guides(color = guide_legend(nrow = 2))


g1
g2

Code
ggplotly(g2 + labs(y = "Standardized DOY", 
                   title = "Standardized variable - since 1950")) |>
  layout(
    legend = list(
      orientation = "h",
      xanchor = "center",
      y = -0.2,
      x = 0.5,
      traceorder = "normal",
      font = list(size = 12),
      bgcolor = "white",
      bordercolor = "#444",
      borderwidth = 1
    )
  )

Yearly weather vs DOY

  • The meteorological data were obtained with R from the Prediction of Worldwide Energy Resources (POWER) project, using the nasapower for extraction of time series (from 1981-01-01 to 2024-02-25) with daily frequency of the following variables:
    • Average Earth Surface Temperature (TS)
    • Maximum temperature of the earth’s surface (TS_MAX)
    • Minimum land surface temperature (TS_MIN)
    • Evaporation from the earth’s surface (EVLAND)
    • Frost days (FROST_DAYS)
    • Precipitation (PRECTOTCORR)
    • Soil moisture profile (GWETPROF)
    • Relative humidity at two meters (RH2M)
    • Soil moisture in the root zone (GWETROOT)
    • The total amount of ozone in a column extending vertically from the Earth’s surface to the top of the atmosphere (TO3)
  • To see the description of each parameter, see POWER resources.
  • As the POWER project is aimed at three user communities, in this case the Agroclimatology community (AG) is used.
  • For these graphs I filter out records that are not from New York since the DOY is very high compared to the other four sites.
  • It is important to mention that these graphs are built with the data summarized by year (data_weather_yearly), however, later the relationship of the climate summarized on a monthly basis (data_weather_monthly) is evaluated.

General

Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 1,
                     loc_name = NULL,
                     pal_colors = colors_custom)$plot1
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 1,
                     loc_name = NULL,
                     pal_colors = colors_custom)$plot2

By location with p=1

  • Charts are shown only for locations that have sufficient information.
Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 1,
                     loc_name = df_coords_predict$location_name[1],
                     pal_colors = colors_custom)

Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 1,
                     loc_name = df_coords_predict$location_name[2],
                     pal_colors = colors_custom)

Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 5,
                     loc_name = df_coords_predict$location_name[3],
                     pal_colors = colors_custom)

By location with p=5

Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 5,
                     loc_name = df_coords_predict$location_name[1],
                     pal_colors = colors_custom)

Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 5,
                     loc_name = df_coords_predict$location_name[2],
                     pal_colors = colors_custom)

Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 5,
                     loc_name = df_coords_predict$location_name[3],
                     pal_colors = colors_custom)

By location with p=10

Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 10,
                     loc_name = df_coords_predict$location_name[1],
                     pal_colors = colors_custom)

Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 10,
                     loc_name = df_coords_predict$location_name[2],
                     pal_colors = colors_custom)

Code
scatterYearlyClimate(data = data_weather_yearly,
                     list_coords = df_coords_predict$lat,
                     n_lag = 10,
                     loc_name = df_coords_predict$location_name[3],
                     pal_colors = colors_custom)

Monthly weather vs DOY

  • These graphs show the correlation profile with the 50 variables with the greatest association (positive or negative).
Code
corrMonthlyClimate(
  data = data_weather_monthly,
  list_coords = df_coords_predict$lat,
  n_lag = 1,
  loc_name = df_coords_predict$location_name[1],
  pal_colors = colors_custom
)$plot1
corrMonthlyClimate(
  data = data_weather_monthly,
  list_coords = df_coords_predict$lat,
  n_lag = 1,
  loc_name = df_coords_predict$location_name[2],
  pal_colors = colors_custom
)$plot1
corrMonthlyClimate(
  data = data_weather_monthly,
  list_coords = df_coords_predict$lat,
  n_lag = 1,
  loc_name = df_coords_predict$location_name[3],
  pal_colors = colors_custom
)$plot1
corrMonthlyClimate(
  data = data_weather_monthly,
  list_coords = df_coords_predict$lat,
  n_lag = 1,
  loc_name = NULL,
  pal_colors = colors_custom
)$plot1